The PDB is the main repository of biomolecular structure data.
Here we grab the current composition statistics from the web page: https://www.rcsb.org/stats/summary
tbl <- read.csv("Data Export Summary.csv", row.names = 1)
tbl
## X.ray NMR EM Multiple.methods Neutron Other Total
## Protein (only) 144433 11881 6732 182 70 32 163330
## Protein/Oligosaccharide 8543 31 1125 5 0 0 9704
## Protein/NA 7621 274 2165 3 0 0 10063
## Nucleic acid (only) 2396 1399 61 8 2 1 3867
## Other 150 31 3 0 0 0 184
## Oligosaccharide (only) 11 6 0 1 0 4 22
Question 1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.
#Check the sums of all the columns in the data set
#colSums(tbl)
#Sum the relevant columns and divide that number by the sum of the "total" column, multiplying the answer by 100 to achieve a percentage
n.type <- colSums(tbl)
n.type / n.type["Total"] * 100
## X.ray NMR EM Multiple.methods
## 87.16888390 7.27787573 5.38868408 0.10632046
## Neutron Other Total
## 0.03846770 0.01976813 100.00000000
#If we were to use the above method the generate the answer for the question, we would want store n.type / n.type["Total"] * 100 to a variable and then type 'r variable[1]' to output the X-ray percentage and 'r variable[3]' to output the EM percentage, since the output is a vector with sevaral values with discrete locations
#The less elegant way I came up with
XR <- sum(tbl[,1]) / sum(tbl[,7]) * 100
EM <- sum(tbl[,3]) / sum(tbl[,7]) * 100
XR
## [1] 87.16888
EM
## [1] 5.388684
#How do we get an output with only 3 decimal places?
XRr <- round(XR, digits = 3)
XRr
## [1] 87.169
EMr <- round(EM, digits = 3)
EMr
## [1] 5.389
The proportion of of X-ray structures is 87.169% of the total structures
The proportion of of EM structures is 5.389% of the total structures
Question 2: What proportion of structures in the PDB are protein?
#Take the total number of protein entries (located in row 1, column 7) and divide it by the sum of the total column
Prot <- round(tbl[1,7] / sum(tbl[,7]) * 100, digits = 3)
Prot
## [1] 87.263
#Barry's more elegant solution
#tbl$Total[1]
#This allows you to not have to know the column number, just the name, and you can still specify the row number you want to access
#This also protects you from issues if the database changes at all, still searching for the 'Total' column regardless of the column position
The proportion of entries that are protein structures is 87.263%
Question 3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?